MTDOT: A Multilingual Translation-Based Data Augmentation Technique for Offensive Content Identification in Tamil Text Data

نویسندگان

چکیده

The posting of offensive content in regional languages has increased as a result the accessibility low-cost internet and widespread use online social media. Despite large number comments available online, only small percentage them are offensive, resulting an unequal distribution non-offensive comments. Due to this class imbalance, classifiers may be biased toward with most samples, i.e., class. To address Multilingual Translation-based Data augmentation technique for Offensive identification Tamil text data (MTDOT) is proposed work. MTDOT method applied HASOC’21, which dataset. obtain balanced dataset, each comment augmented using multi-level back translation English Malayalam intermediate languages. Another dataset generated by employing single-level Malayalam, Kannada, Telugu While both approaches equally effective, back-translation approach produces more diverse data, evident from BLEU score. work achieved promising improvement F1-score over widely used SMOTE balancing 65%.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

a new approach to credibility premium for zero-inflated poisson models for panel data

هدف اصلی از این تحقیق به دست آوردن و مقایسه حق بیمه باورمندی در مدل های شمارشی گزارش نشده برای داده های طولی می باشد. در این تحقیق حق بیمه های پبش گویی بر اساس توابع ضرر مربع خطا و نمایی محاسبه شده و با هم مقایسه می شود. تمایل به گرفتن پاداش و جایزه یکی از دلایل مهم برای گزارش ندادن تصادفات می باشد و افراد برای استفاده از تخفیف اغلب از گزارش تصادفات با هزینه پائین خودداری می کنند، در این تحقیق ...

15 صفحه اول

A statistical test for outlier identification in data envelopment analysis

In the use of peer group data to assess individual, typical or best practice performance, the effective detection of outliers is critical for achieving useful results. In these ‘‘deterministic’’ frontier models, statistical theory is now mostly available. This paper deals with the statistical pared sample method and its capability of detecting outliers in data envelopment analysis. In the prese...

متن کامل

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

investigating the integration of translation technologies into translation programs in iranian universities: basis for a syllabus design in translation technology

today, information technology and computers are indispensable tools of any profession and translation technologies have become an indispensable part of translator’s workstation. with the increasing demands for high productivity and speed as well as consistency and with the rise of new demands for translation and localization, it is necessary for translators to be familiar with market demands an...

Data Issues of the Multilingual Translation Matrix

We describe our experiments with phrasebased machine translation for the WMT 2012 Shared Task. We trained one system for 14 translation directions between English or Czech on one side and English, Czech, German, Spanish or French on the other side. We describe a set of results with different training data sizes and subsets.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Electronics

سال: 2022

ISSN: ['2079-9292']

DOI: https://doi.org/10.3390/electronics11213574